Olist E-commerce Customer Segmentation Project¶

Context¶

This project involves analyzing Olist's e-commerce data to develop customer segmentation and key performance indicators (KPIs) for a Customer Experience Dashboard. The goal is to extract actionable insights for marketing campaigns and customer behavior analysis.

Project Objectives¶

  1. Implement SQL queries to extract data for the Customer Experience Dashboard.
  2. Develop customer segmentation to support targeted marketing campaigns.
  3. Analyze customer behavior patterns to identify trends and opportunities.
In [1]:
# Configure Plotly to properly render in HTML exports
import plotly.io as pio
import plotly.graph_objects as go
import os

# Set the renderer for notebook display
pio.renderers.default = "notebook"

# Enable figure exports with all necessary dependencies embedded
#pio.write_html_config.include_plotlyjs = 'cdn'
#pio.write_html_config.include_mathjax = 'cdn'
#pio.write_html_config.full_html = True

# Configure global theme for consistent appearance
pio.templates.default = "plotly_white"

1. Data Exploration¶

1.1 Database Connection Setup¶

  • Objective: Establish a connection to the SQLite database and verify the available tables.
  • Steps:
    1. Initialize the database connection.
    2. List all available tables in the dataset.
    3. Preview the first few rows of each table to understand its structure.
In [2]:
from src.scripts.sqlite_connector import DatabaseConnection
import os

# Get current working directory
cwd = os.getcwd()

# Initialize database connection
db_path = os.path.join(cwd, 'dataset', 'olist.db')
db = DatabaseConnection(db_path)

# Get all table names
tables = db.get_table_names()
print("Available tables:", tables)

# Read specific table
for table in tables:
    orders_df = db.read_table(table)
    display(orders_df.head())
Available tables: ['customers', 'geoloc', 'order_items', 'order_pymts', 'order_reviews', 'orders', 'products', 'sellers', 'translation']
index customer_id customer_unique_id customer_zip_code_prefix customer_city customer_state
0 0 06b8999e2fba1a1fbc88172c00ba8bc7 861eff4711a542e4b93843c6dd7febb0 14409 franca SP
1 1 18955e83d337fd6b2def6b18a428ac77 290c77bc529b7ac935b93aa66c333dc3 9790 sao bernardo do campo SP
2 2 4e7b3e00288586ebd08712fdd0374a03 060e732b5b29e8181a18229c7b0b2b5e 1151 sao paulo SP
3 3 b2b6027bc5c5109e529d4dc6358b12c3 259dac757896d24d7702b9acbbff3f3c 8775 mogi das cruzes SP
4 4 4f2d8ab171c80ec8364f7c12e35b23ad 345ecd01c38d18a9036ed96c73b8d066 13056 campinas SP
index geolocation_zip_code_prefix geolocation_lat geolocation_lng geolocation_city geolocation_state
0 0 1037 -23.545621 -46.639292 sao paulo SP
1 1 1046 -23.546081 -46.644820 sao paulo SP
2 2 1046 -23.546129 -46.642951 sao paulo SP
3 3 1041 -23.544392 -46.639499 sao paulo SP
4 4 1035 -23.541578 -46.641607 sao paulo SP
index order_id order_item_id product_id seller_id shipping_limit_date price freight_value
0 0 00010242fe8c5a6d1ba2dd792cb16214 1 4244733e06e7ecb4970a6e2683c13e61 48436dade18ac8b2bce089ec2a041202 2017-09-19 09:45:35 58.90 13.29
1 1 00018f77f2f0320c557190d7a144bdd3 1 e5f2d52b802189ee658865ca93d83a8f dd7ddc04e1b6c2c614352b383efe2d36 2017-05-03 11:05:13 239.90 19.93
2 2 000229ec398224ef6ca0657da4fc703e 1 c777355d18b72b67abbeef9df44fd0fd 5b51032eddd242adc84c38acab88f23d 2018-01-18 14:48:30 199.00 17.87
3 3 00024acbcdf0a6daa1e931b038114c75 1 7634da152a4610f1595efa32f14722fc 9d7a1d34a5052409006425275ba1c2b4 2018-08-15 10:10:18 12.99 12.79
4 4 00042b26cf59d7ce69dfabb4e55b4fd9 1 ac6c3623068f30de03045865e4e10089 df560393f3a51e74553ab94004ba5c87 2017-02-13 13:57:51 199.90 18.14
index order_id payment_sequential payment_type payment_installments payment_value
0 0 b81ef226f3fe1789b1e8b2acac839d17 1 credit_card 8 99.33
1 1 a9810da82917af2d9aefd1278f1dcfa0 1 credit_card 1 24.39
2 2 25e8ea4e93396b6fa0d3dd708e76c1bd 1 credit_card 1 65.71
3 3 ba78997921bbcdc1373bb41e913ab953 1 credit_card 8 107.78
4 4 42fdf880ba16b47b59251dd489d4441a 1 credit_card 2 128.45
index review_id order_id review_score review_comment_title review_comment_message review_creation_date review_answer_timestamp
0 0 7bc2406110b926393aa56f80a40eba40 73fc7af87114b39712e6da79b0a377eb 4 None None 2018-01-18 00:00:00 2018-01-18 21:46:59
1 1 80e641a11e56f04c1ad469d5645fdfde a548910a1c6147796b98fdf73dbeba33 5 None None 2018-03-10 00:00:00 2018-03-11 03:05:13
2 2 228ce5500dc1d8e020d8d1322874b6f0 f9e4b658b201a9f2ecdecbb34bed034b 5 None None 2018-02-17 00:00:00 2018-02-18 14:36:24
3 3 e64fb393e7b32834bb789ff8bb30750e 658677c97b385a9be170737859d3511b 5 None Recebi bem antes do prazo estipulado. 2017-04-21 00:00:00 2017-04-21 22:02:06
4 4 f7c4243c7fe1938f181bec41a392bdeb 8e6bfb81e283fa7e4f11123a3fb894f1 5 None Parabéns lojas lannister adorei comprar pela I... 2018-03-01 00:00:00 2018-03-02 10:26:53
index order_id customer_id order_status order_purchase_timestamp order_approved_at order_delivered_carrier_date order_delivered_customer_date order_estimated_delivery_date
0 0 e481f51cbdc54678b7cc49136f2d6af7 9ef432eb6251297304e76186b10a928d delivered 2017-10-02 10:56:33 2017-10-02 11:07:15 2017-10-04 19:55:00 2017-10-10 21:25:13 2017-10-18 00:00:00
1 1 53cdb2fc8bc7dce0b6741e2150273451 b0830fb4747a6c6d20dea0b8c802d7ef delivered 2018-07-24 20:41:37 2018-07-26 03:24:27 2018-07-26 14:31:00 2018-08-07 15:27:45 2018-08-13 00:00:00
2 2 47770eb9100c2d0c44946d9cf07ec65d 41ce2a54c0b03bf3443c3d931a367089 delivered 2018-08-08 08:38:49 2018-08-08 08:55:23 2018-08-08 13:50:00 2018-08-17 18:06:29 2018-09-04 00:00:00
3 3 949d5b44dbf5de918fe9c16f97b45f8a f88197465ea7920adcdbec7375364d82 delivered 2017-11-18 19:28:06 2017-11-18 19:45:59 2017-11-22 13:39:59 2017-12-02 00:28:42 2017-12-15 00:00:00
4 4 ad21c59c0840e6cb83a9ceb5573f8159 8ab97904e6daea8866dbdbc4fb7aad2c delivered 2018-02-13 21:18:39 2018-02-13 22:20:29 2018-02-14 19:46:34 2018-02-16 18:17:02 2018-02-26 00:00:00
index product_id product_category_name product_name_lenght product_description_lenght product_photos_qty product_weight_g product_length_cm product_height_cm product_width_cm
0 0 1e9e8ef04dbcff4541ed26657ea517e5 perfumaria 40.0 287.0 1.0 225.0 16.0 10.0 14.0
1 1 3aa071139cb16b67ca9e5dea641aaa2f artes 44.0 276.0 1.0 1000.0 30.0 18.0 20.0
2 2 96bd76ec8810374ed1b65e291975717f esporte_lazer 46.0 250.0 1.0 154.0 18.0 9.0 15.0
3 3 cef67bcfe19066a932b7673e239eb23d bebes 27.0 261.0 1.0 371.0 26.0 4.0 26.0
4 4 9dc1a7de274444849c219cff195d0b71 utilidades_domesticas 37.0 402.0 4.0 625.0 20.0 17.0 13.0
index seller_id seller_zip_code_prefix seller_city seller_state
0 0 3442f8959a84dea7ee197c632cb2df15 13023 campinas SP
1 1 d1b65fc7debc3361ea86b5f14c68d2e2 13844 mogi guacu SP
2 2 ce3ad9de960102d0677a81f5d0bb7b2d 20031 rio de janeiro RJ
3 3 c0f3eea2e14555b6faeea3dd58c1b1c3 4195 sao paulo SP
4 4 51a04a8a6bdcb23deccc82b0b80742cf 12914 braganca paulista SP
index product_category_name product_category_name_english
0 0 beleza_saude health_beauty
1 1 informatica_acessorios computers_accessories
2 2 automotivo auto
3 3 cama_mesa_banho bed_bath_table
4 4 moveis_decoracao furniture_decor

1.2 Data Overview¶

  • Objective: Understand the contents of the database.
  • Key Tables:
    • Orders: Contains order history and delivery details.
    • Customers: Includes customer demographic and geographical data.
    • Products: Provides product details and categories.
    • Reviews: Contains customer satisfaction reviews.
    • Sellers: Includes seller information and performance metrics.

1.3 Query 1: Late Deliveries Analysis¶

  • Objective: Identify recent orders (less than 3 months old) that were delivered with a delay of at least 3 days, excluding canceled orders.
  • Insights:
    • Helps assess delivery performance and identify areas for improvement.
In [3]:
query_late_deliveries = """
    WITH delivery_stats AS (
        SELECT MAX(order_purchase_timestamp) as latest_estimate
        FROM orders
        WHERE order_status != 'canceled'
    ),
    delayed_deliveries AS (
        SELECT o.*
        FROM orders o, delivery_stats d
        WHERE 
            o.order_estimated_delivery_date IS NOT NULL AND o.order_delivered_customer_date IS NOT NULL
            AND o.order_status != 'canceled'
            AND o.order_delivered_customer_date > date(o.order_estimated_delivery_date, '+3 days')
            AND o.order_purchase_timestamp BETWEEN 
                date(d.latest_estimate, '-3 months') 
                AND d.latest_estimate
    )
    SELECT * FROM delayed_deliveries;
    """
df_late_deliveries = db.execute_query(query_late_deliveries)
display(df_late_deliveries)
index order_id customer_id order_status order_purchase_timestamp order_approved_at order_delivered_carrier_date order_delivered_customer_date order_estimated_delivery_date
0 561 cfa4fa27b417971e86d8127cb688712f 7093250e1741ebbed41f0cc552025fd6 delivered 2018-08-16 09:44:23 2018-08-17 03:10:22 2018-08-17 18:55:00 2018-08-29 01:41:41 2018-08-22 00:00:00
1 711 234c056c50619f48da64f731c48242b4 44e460a655f7154ccd9faa4dbbbaf68a delivered 2018-08-14 14:49:15 2018-08-15 03:15:15 2018-08-31 15:25:00 2018-09-01 18:14:42 2018-08-23 00:00:00
2 728 8ad3f1d0f96992e43566c4c82c9f6c58 948b29e24216a05fea13a18d8db45ea5 delivered 2018-07-17 21:25:29 2018-07-17 21:35:17 2018-07-18 13:08:00 2018-08-14 04:04:40 2018-08-03 00:00:00
3 1335 43f9a3d3e9d95525bfce717e31f72a56 b003d09f32a12bcc00b6ca04d46554e6 delivered 2018-07-11 00:13:33 2018-07-12 03:10:16 2018-07-27 16:27:00 2018-08-04 16:15:46 2018-07-30 00:00:00
4 1844 7f579e203c931f3e8410103359c6d523 d665be250d1c687c58fdea61a9b55a58 delivered 2018-08-02 18:09:27 2018-08-02 18:43:43 2018-08-10 14:15:00 2018-08-13 20:11:47 2018-08-09 00:00:00
... ... ... ... ... ... ... ... ... ...
439 98368 874a7690bc049bd4ce210d195bdfff7b f6b2c83ea477af36f44a1e2df7320036 delivered 2018-08-02 12:06:47 2018-08-02 13:05:56 2018-08-10 14:42:00 2018-08-13 15:50:48 2018-08-08 00:00:00
440 98382 587e32dd528769d669d539531d32aeb5 0af8a25fb0b6f833724bb5818275807b delivered 2018-08-10 11:46:09 2018-08-11 02:50:25 2018-08-14 10:09:00 2018-09-03 09:32:31 2018-08-28 00:00:00
441 98564 3adb141ba4bd69dd7fe8d3fb733c6b74 c0539d5c87fc7c97a8418adffe4b45f0 delivered 2018-08-14 23:29:21 2018-08-16 03:05:11 2018-08-16 13:28:00 2018-08-28 18:02:52 2018-08-24 00:00:00
442 99279 4cf09d9e5ebbe0f91ddd7bf9aae891cd 07b6b4fe5fefb948fc76b6d2bdba77d8 delivered 2018-07-19 08:37:26 2018-07-21 03:25:17 2018-07-23 15:31:00 2018-08-21 01:12:45 2018-08-10 00:00:00
443 99396 b3112ca67f3afd4e20cf2ee91fc4f804 6f83c71b6c044fb156d697d4130fe9b5 delivered 2018-08-02 22:46:54 2018-08-02 23:04:06 2018-08-15 17:42:00 2018-08-21 00:03:26 2018-08-16 00:00:00

444 rows × 9 columns

In [4]:
query_last_6months_deliveries = """
    WITH delivery_stats AS (
        SELECT MAX(order_purchase_timestamp) as latest_estimate
        FROM orders
        WHERE order_status != 'canceled'
    ),
    delayed_deliveries AS (
        SELECT o.*
        FROM orders o, delivery_stats d
        WHERE 
            o.order_estimated_delivery_date IS NOT NULL AND o.order_delivered_customer_date IS NOT NULL
            AND o.order_status != 'canceled'
            AND o.order_purchase_timestamp BETWEEN 
                date(d.latest_estimate, '-6 months') 
                AND d.latest_estimate
    )
    SELECT * FROM delayed_deliveries;
    """
df_deliveries_6months = db.execute_query(query_last_6months_deliveries)
display(df_deliveries_6months)
index order_id customer_id order_status order_purchase_timestamp order_approved_at order_delivered_carrier_date order_delivered_customer_date order_estimated_delivery_date
0 1 53cdb2fc8bc7dce0b6741e2150273451 b0830fb4747a6c6d20dea0b8c802d7ef delivered 2018-07-24 20:41:37 2018-07-26 03:24:27 2018-07-26 14:31:00 2018-08-07 15:27:45 2018-08-13 00:00:00
1 2 47770eb9100c2d0c44946d9cf07ec65d 41ce2a54c0b03bf3443c3d931a367089 delivered 2018-08-08 08:38:49 2018-08-08 08:55:23 2018-08-08 13:50:00 2018-08-17 18:06:29 2018-09-04 00:00:00
2 12 82566a660a982b15fb86e904c8d32918 d3e3b74c766bc6214e0c830b17ee2341 delivered 2018-06-07 10:06:19 2018-06-09 03:13:12 2018-06-11 13:29:00 2018-06-19 12:05:52 2018-07-18 00:00:00
3 13 5ff96c15d0b717ac6ad1f3d77225a350 19402a48fe860416adf93348aba37740 delivered 2018-07-25 17:44:10 2018-07-25 17:55:14 2018-07-26 13:16:00 2018-07-30 15:52:25 2018-08-08 00:00:00
4 15 dcb36b511fcac050b97cd5c05de84dc3 3b6828a50ffe546942b7a473d70ac0fc delivered 2018-06-07 19:03:12 2018-06-12 23:31:02 2018-06-11 14:54:00 2018-06-21 15:34:32 2018-07-04 00:00:00
... ... ... ... ... ... ... ... ... ...
38618 99425 d692ef54145c9cb3322ec2e5508aa3f4 82ddfcf9438b0cd1117b55ac33184df8 delivered 2018-03-21 19:47:18 2018-03-21 20:05:26 2018-03-22 21:11:58 2018-04-11 00:48:31 2018-04-09 00:00:00
38619 99427 19dba6d1e58b04e7fc820558a38ea398 cf97e9cf454480b303de6736ddd15fd6 delivered 2018-05-12 13:42:24 2018-05-12 13:55:17 2018-05-16 08:14:00 2018-05-18 17:18:29 2018-06-04 00:00:00
38620 99428 c22a47117b6a87c967b0c278488110c1 6e153567dc52ea59c0498ef5383e9c7a delivered 2018-06-22 20:53:29 2018-06-22 21:17:45 2018-06-25 17:50:00 2018-07-03 19:48:54 2018-07-23 00:00:00
38621 99430 c81f74e50f0496fa39716cc77cacd460 7b270ebc87c25c8404348c10ff80a80e delivered 2018-03-04 22:48:38 2018-03-04 23:09:22 2018-03-05 16:15:04 2018-03-13 18:54:32 2018-03-26 00:00:00
38622 99440 66dea50a8b16d9b4dee7af250b4be1a5 edb027a75a1449115f6b43211ae02a24 delivered 2018-03-08 20:57:30 2018-03-09 11:20:28 2018-03-09 22:11:59 2018-03-16 13:08:30 2018-04-03 00:00:00

38623 rows × 9 columns

In [5]:
from src.scripts.visualize_late_deliveries import visualize_late_deliveries

# Create visualization for late deliveries
fig = visualize_late_deliveries(df_deliveries_6months)
fig.show()

1.4 Query 2: High Revenue Sellers¶

  • Objective: Identify sellers who generated over 100,000 BRL in revenue from delivered orders.
  • Insights:
    • Highlights top-performing sellers for potential partnerships or incentives.
In [6]:
query_high_revenue = """
SELECT 
    s.seller_id,
    CAST(SUM(oi.price + oi.freight_value) AS INTEGER) as total_revenue
FROM sellers s
    JOIN order_items oi ON s.seller_id = oi.seller_id
    JOIN orders o ON oi.order_id = o.order_id WHERE o.order_status = 'delivered'
GROUP BY s.seller_id
HAVING total_revenue > 100000
ORDER BY total_revenue DESC;
"""
df_high_revenue = db.execute_query(query_high_revenue)
display(df_high_revenue)
seller_id total_revenue
0 4869f7a5dfa277a7dca6462dcf3b52b2 247007
1 7c67e1448b00f6e969d365cea6b010ab 237806
2 4a3ca9315b744ce9f8e9374361493884 231220
3 53243585a1d6dc2643021fd1853d8905 230797
4 fa1c13f2614d7b5c4749cbc52fecda94 200833
5 da8622b14eb17ae2831f4ac5b9dab84a 184706
6 7e93a43ef30c4f03f38b393420bc753a 171973
7 1025f0e2d44d7041d6cf58b6550e0bfa 171924
8 7a67c85e85bb2ce8582c35f2203ad736 160278
9 955fee9216a65b617aa5c0531780ce60 156606
10 6560211a19b47992c3666cc44a7e94c0 148050
11 1f50f920176fa81dab994f9023523100 141712
12 46dc3b2cc0980fb8ec44634e21d2718e 134162
13 a1043bafd471dff536d0c462352beb48 130412
14 620c87c171fb2a6dd6e8bb4dec959fc6 126278
15 cc419e0650a3c5ba77189a1882b7556a 125936
16 5dceca129747e92ff8ef7a997dc4f8ca 124702
17 7d13fca15225358621be4086e1eb0964 120934
18 3d871de0142ce09b7081e2b9d1733cb1 115515
In [7]:
from src.scripts.visualize_top_sellers import visualize_top_sellers

# Create visualization for high revenue sellers
fig = visualize_top_sellers(df_high_revenue)
fig.show()

1.5 Query 3: Engaged New Sellers¶

  • Objective: Identify new sellers (active for less than 3 months) who have sold more than 30 products.
  • Insights:
    • Helps track the onboarding success of new sellers.
In [8]:
query_new_engaged_sellers = """
WITH seller_stats AS (
    SELECT 
        s.seller_id,
        MIN(o.order_purchase_timestamp) as first_sale,
        COUNT(DISTINCT oi.order_id) as total_orders,
        COUNT(oi.product_id) as total_products
    FROM sellers s
        JOIN order_items oi ON s.seller_id = oi.seller_id
        JOIN orders o ON oi.order_id = o.order_id
    GROUP BY s.seller_id
),
latest_purchase_date AS (
    SELECT MAX(order_purchase_timestamp) as latest_purchase
    FROM orders
)

SELECT s.*
FROM seller_stats s, latest_purchase_date l
    WHERE s.total_products > 30
    AND s.first_sale >= date(l.latest_purchase, '-3 months')
ORDER BY s.total_products DESC;
"""
df_new_engaged_sellers = db.execute_query(query_new_engaged_sellers)
display(df_new_engaged_sellers)
seller_id first_sale total_orders total_products
0 d13e50eaa47b4cbe9eb81465865d8cfc 2018-08-04 09:09:37 67 69
1 81f89e42267213cb94da7ddc301651da 2018-08-08 12:45:12 46 52
2 240b9776d844d37535668549a396af32 2018-07-17 13:48:59 36 36

1.6 Query 4: Worst Reviewed Postal Codes¶

  • Objective: Identify the top 5 postal codes with the worst average review scores (minimum 30 reviews) in the last 12 months.
  • Insights:
    • Pinpoints geographical areas with potential service or product quality issues.
In [9]:
query_worst_postcodes = """

WITH latest_purchase_date AS (
    SELECT MAX(order_purchase_timestamp) as latest_purchase
    FROM orders
)

SELECT 
    c.customer_zip_code_prefix,
    COUNT(r.review_id) as review_count,
    ROUND(AVG(r.review_score), 2) as avg_score
FROM customers c, latest_purchase_date l
    JOIN orders o ON c.customer_id = o.customer_id
    JOIN order_reviews r ON o.order_id = r.order_id
WHERE r.review_creation_date >= date(l.latest_purchase, '-12 months')
GROUP BY c.customer_zip_code_prefix
HAVING review_count >= 30
ORDER BY avg_score ASC
LIMIT 5;
"""
df_worst_postcodes= db.execute_query(query_worst_postcodes)
display(df_worst_postcodes)
customer_zip_code_prefix review_count avg_score
0 22753 48 2.83
1 22770 38 3.18
2 22793 93 3.25
3 13056 31 3.29
4 13295 32 3.31
In [10]:
query_all_postcodes = """

WITH latest_purchase_date AS (
    SELECT MAX(order_purchase_timestamp) as latest_purchase
    FROM orders
)

SELECT 
    c.customer_zip_code_prefix,
    COUNT(r.review_id) as review_count,
    ROUND(AVG(r.review_score), 2) as avg_score
FROM customers c, latest_purchase_date l
    JOIN orders o ON c.customer_id = o.customer_id
    JOIN order_reviews r ON o.order_id = r.order_id
WHERE r.review_creation_date >= date(l.latest_purchase, '-12 months')
GROUP BY c.customer_zip_code_prefix
HAVING review_count >= 1
ORDER BY avg_score ASC;

"""
df_all_postcodes= db.execute_query(query_all_postcodes)
In [11]:
from src.scripts.visualize_customer_review_bubble import create_brazil_postcode_map
# Create and display the map
fig = create_brazil_postcode_map(df_all_postcodes)
fig.show()

2. Data Extraction and Feature Engineering¶

2.1 RFM Features¶

  • Objective: Calculate Recency, Frequency, and Monetary (RFM) metrics for each customer.
    • Recency: Time since the last purchase.
    • Frequency: Number of orders placed.
    • Monetary: Total spending by the customer.
  • Insights:
    • RFM metrics are foundational for customer segmentation.
In [12]:
query_rfm = """
WITH last_purchase_date AS (
    SELECT MAX(order_purchase_timestamp) as max_date
    FROM orders
    WHERE order_status = 'delivered'
)
SELECT 
    c.customer_id,
    MAX(o.order_purchase_timestamp) as order_purchase_timestamp,
    -- Recency
    CAST(JULIANDAY(l.max_date) - JULIANDAY(MAX(o.order_purchase_timestamp)) AS INTERGER) as recency_days,
    -- Frequency
    COUNT(o.order_id) as frequency,
    -- Monetary
    CAST(SUM(oi.price + oi.freight_value) AS INTEGER) as monetary
FROM customers c
JOIN orders o ON c.customer_id = o.customer_id
JOIN order_items oi ON o.order_id = oi.order_id
CROSS JOIN last_purchase_date l
WHERE o.order_status = 'delivered'
GROUP BY c.customer_id;
"""
df_rfm= db.execute_query(query_rfm)
display(df_rfm)
customer_id order_purchase_timestamp recency_days frequency monetary
0 00012a2ce6f8dcda20d059ce98491703 2017-11-14 16:08:26 287 1 114
1 000161a058600d5901f007fab4c27140 2017-07-16 09:40:32 409 1 67
2 0001fd6190edaaf884bcaf3d49edf079 2017-02-28 11:06:43 547 1 195
3 0002414f95344307404f0ace7a26f1d5 2017-08-16 13:09:20 378 1 179
4 000379cdec625522490c315e70c7a9fb 2018-04-02 13:42:17 149 1 107
... ... ... ... ... ...
96473 fffcb937e9dd47a13f05ecb8290f4d3e 2018-03-17 00:55:27 165 1 91
96474 fffecc9f79fd8c764f843e9951b11341 2018-03-29 16:59:26 152 1 81
96475 fffeda5b6d849fbd39689bb92087f431 2018-05-22 13:36:02 99 1 63
96476 ffff42319e9b2d713724ae527742af25 2018-06-13 16:57:05 76 1 214
96477 ffffa3172527f765de70084a7e53aae8 2017-09-02 11:53:32 361 2 45

96478 rows × 5 columns

2.2 Customer Satisfaction Metrics¶

  • Objective: Analyze customer satisfaction based on review data.
    • Average Review Score: Overall satisfaction level.
    • Review Count: Number of reviews submitted.
    • Negative Reviews: Count of reviews with scores ≤ 2.
  • Insights:
    • Helps identify dissatisfied customers and areas for improvement.
In [13]:
query_satisfaction = """
WITH customers_with_negative AS (
    SELECT DISTINCT o.customer_id, 1 as has_negative_flag
    FROM orders o
    JOIN order_reviews r ON o.order_id = r.order_id
    WHERE r.review_score <= 2
)

SELECT
    o.customer_id,
    CAST(AVG(r.review_score) AS INTEGER) as avg_review_score,
    COUNT(r.review_id) as review_count,
    COALESCE(cwn.has_negative_flag, 0) as has_negative_flag
FROM orders o
JOIN order_reviews r ON o.order_id = r.order_id
LEFT JOIN customers_with_negative cwn ON o.customer_id = cwn.customer_id
GROUP BY o.customer_id;
"""
df_satisfaction = db.execute_query(query_satisfaction)
display(df_satisfaction)
customer_id avg_review_score review_count has_negative_flag
0 00012a2ce6f8dcda20d059ce98491703 1 1 1
1 000161a058600d5901f007fab4c27140 4 1 0
2 0001fd6190edaaf884bcaf3d49edf079 5 1 0
3 0002414f95344307404f0ace7a26f1d5 5 1 0
4 000379cdec625522490c315e70c7a9fb 4 1 0
... ... ... ... ...
98668 fffecc9f79fd8c764f843e9951b11341 3 1 0
98669 fffeda5b6d849fbd39689bb92087f431 4 1 0
98670 ffff42319e9b2d713724ae527742af25 5 1 0
98671 ffffa3172527f765de70084a7e53aae8 5 1 0
98672 ffffe8b65bbe3087b653a978c870db99 5 1 0

98673 rows × 4 columns

2.3 Customer Behavior Patterns¶

  • Objective: Extract behavioral insights from customer purchase data.
    • Product Category Diversity: Number of unique product categories purchased.
    • Average Delivery Time: Time taken for orders to be delivered.
    • Unique Sellers: Number of sellers a customer has interacted with.
  • Insights:
    • Provides a deeper understanding of customer preferences and habits.
In [14]:
query_behavior = """
SELECT 
    o.customer_id,
    COUNT( p.product_category_name) as unique_categories,
    AVG(CAST(JULIANDAY(order_delivered_customer_date) - 
        JULIANDAY(order_purchase_timestamp) AS INTEGER)) as avg_delivery_time,
    COUNT( oi.seller_id) as unique_sellers
FROM orders o
JOIN order_items oi ON o.order_id = oi.order_id
JOIN products p ON oi.product_id = p.product_id
WHERE o.order_status = 'delivered'
GROUP BY o.customer_id;
"""
df_behavior= db.execute_query(query_behavior)
display(df_behavior)
customer_id unique_categories avg_delivery_time unique_sellers
0 00012a2ce6f8dcda20d059ce98491703 1 13.0 1
1 000161a058600d5901f007fab4c27140 1 9.0 1
2 0001fd6190edaaf884bcaf3d49edf079 1 5.0 1
3 0002414f95344307404f0ace7a26f1d5 1 28.0 1
4 000379cdec625522490c315e70c7a9fb 1 11.0 1
... ... ... ... ...
96473 fffcb937e9dd47a13f05ecb8290f4d3e 1 4.0 1
96474 fffecc9f79fd8c764f843e9951b11341 1 12.0 1
96475 fffeda5b6d849fbd39689bb92087f431 1 17.0 1
96476 ffff42319e9b2d713724ae527742af25 1 5.0 1
96477 ffffa3172527f765de70084a7e53aae8 2 12.0 2

96478 rows × 4 columns

3. Feature Analysis and Transformation¶

3.1 Merge Features¶

  • Objective: Combine RFM metrics, satisfaction metrics, and behavior patterns into a unified dataset for analysis.
  • Outcome:
    • A consolidated DataFrame ready for further analysis and modeling.
In [15]:
from src.classes.feature_analysis import FeatureAnalysis
from src.classes.feature_engineering import FeatureEngineering

# Initialize with existing query results
fe = FeatureEngineering(df_rfm, df_satisfaction, df_behavior)
features_df = fe.combine_features()



# Plot results
fa = FeatureAnalysis(features_df, columns_to_exclude=['order_id', 'customer_id'])
dist_plot = fa.plot_distributions()
dist_plot.show()

3.2 Feature Distribution Analysis¶

  • Objective: Explore the distribution of features to identify patterns, outliers, and preprocessing needs.
  • Steps:
    1. Statistical Summaries: Compute mean, median, standard deviation, etc.
    2. Distribution Plots: Visualize feature distributions using histograms or density plots.
    3. Outlier Detection: Use box plots to identify extreme values.
  • Outcome:
    • Informs decisions on scaling, normalization, and handling outliers.
In [16]:
from src.scripts.analyze_numerical_outliers import analyze_outliers_with_multiple_methods

# Analyze all numeric variables with different outlier detection methods
all_summaries, all_cleaned_dfs = analyze_outliers_with_multiple_methods(features_df)

3.3 Box Plot Analysis¶

  • Objective: Visualize feature distributions and detect outliers.
  • Steps:
    1. Plot box plots for numerical features.
    2. Highlight features with significant outliers.
  • Outcome:
    • Guides feature scaling and normalization.
In [17]:
from src.scripts.visualize_numerical_outliers import create_interactive_outlier_visualization

# Create the interactive outlier visualization
summary_df, df_cleaned = create_interactive_outlier_visualization(all_cleaned_dfs['Z-score (±3)'])
Data Summary:
Column Skewness Mean Median StdDev
0 recency_days 0.431893 239.116453 220.0 152.830149
1 frequency 3.135869 1.078442 1.0 0.268868
2 monetary 2.327219 138.188126 103.0 119.647027
3 avg_review_score -1.481654 4.155585 5.0 1.284016
4 review_count 0.000000 1.000000 1.0 0.000000
5 has_negative_flag 2.220768 0.128464 0.0 0.334608
6 unique_categories 1.792656 1.063158 1.0 0.295683
7 avg_delivery_time 1.217369 11.374220 10.0 7.324842
8 unique_sellers 3.135869 1.078442 1.0 0.268868

3.4 RFM dashboard review after normalization¶

  • Objective: Visualize RFM feature distributions after removal of outliers.
In [18]:
# Create a hybrid dataframe with cleaned continuous variables but original flags/counts
hybrid_df = all_cleaned_dfs['Z-score (±3)'].copy()

# List of columns to restore from original data
categorical_cols = ['has_negative_flag'] 
count_cols = ['review_count', 'unique_categories', 'frequency', 'unique_sellers']

# Replace the values with originals
for col in categorical_cols + count_cols:
    if col in features_df.columns and col in hybrid_df.columns:
        hybrid_df[col] = features_df[col]

# Use the hybrid dataframe for visualization
fa = FeatureAnalysis(hybrid_df, columns_to_exclude=['customer_id'])
dist_plot = fa.plot_distributions()
dist_plot.show()

3.3 Feature Correlations¶

  • Objective: Analyze relationships between features using a correlation matrix.
  • Steps:
    1. Compute the correlation matrix for numerical features.
    2. Visualize the lower triangle of the matrix using a heatmap.
  • Outcome:
    • Identifies highly correlated features and potential redundancies.
In [19]:
# Plot results
corr_plot = fa.plot_correlation_matrix()
corr_plot.show()

4. Implement clustering algorithms¶

4.1 Feature Transformations for Clustering¶

The feature transformation process implements a sophisticated approach to prepare customer data for clustering, ensuring optimal algorithm performance while maintaining interpretability:

Transformation Components¶

  1. Data Preparation

    • Set customer_id as the index to maintain customer identity throughout the analysis
    • Removed temporal variables (order_purchase_timestamp) that aren't relevant for segmentation
    • Created a clean foundation for applying consistent transformations
  2. Specialized Transformation Strategy

    • Implemented the GenericFeatureTransformer to handle different variable types appropriately
    • Applied specific transformation methods to three distinct variable categories:
      • Numerical variables: Applied robust scaling to minimize outlier impact
      • Categorical indicators: Properly encoded the has_negative_flag variable
      • Count variables: Applied special handling to preserve the count nature while standardizing scale
  3. Count Variable Preservation

    • Recognized that count variables (review_count, unique_categories, frequency, unique_sellers) have special properties
    • Implemented specialized transformations that maintain their inherent characteristics
    • Ensured that zero counts remain meaningful in the transformed space
  4. Reversible Transformations

    • Stored transformation parameters to enable inverse transformation
    • Created capability to convert clusters back to original feature space
    • Ensured business stakeholders can understand results in familiar metrics

Outcome¶

The transformation process creates a dataset that:

  • Places all variables on comparable scales for distance-based clustering algorithms
  • Preserves the inherent structure and relationships between variables
  • Minimizes the impact of outliers without losing their information
  • Maintains the ability to interpret results in the original business context

This approach balances mathematical optimization with business interpretability, enabling both effective clustering and actionable insights from the resulting segments.

In [20]:
from src.classes.feature_transformation import GenericFeatureTransformer

#Set customer_id as index before transformation
hybrid_df = hybrid_df.set_index('customer_id')
hybrid_df = hybrid_df.drop(columns=['order_purchase_timestamp'], errors='ignore')

# Create and apply the generic transformer
transformer = GenericFeatureTransformer()
transformed_df = transformer.fit_transform(
    df=hybrid_df,
    categorical_cols=['has_negative_flag'],
    count_cols=['review_count', 'unique_categories', 'frequency', 'unique_sellers']
)

# Get original data back when needed
original_df = transformer.inverse_transform(transformed_df)

4.2 Clustering Analysis with Advanced Feature Importance¶

The customer segmentation implementation uses a sophisticated dual-model approach combining PCA transformation with robust feature importance analysis:

Analysis Components¶

  1. PCA Component Analysis

    • Applied dimensionality reduction to visualize high-dimensional customer data
    • Created explained variance plots to identify optimal components (87% variance captured in top 3 components)
    • Generated biplots showing feature loadings on principal components for interpretability
    • Identified which features contribute most to each principal component
  2. Optimal Cluster Determination

    • Implemented optimized elbow method with silhouette scoring
    • Analyzed inertia and silhouette metrics to identify the ideal number of clusters
    • Selected 3 clusters based on the balance between complexity and interpretability
  3. Dual-Model K-Means Implementation

    • Trained PCA-based model for visualization (using 3D PCA components)
    • Trained feature-space model for interpretability (using original features)
    • Maintained both models to maximize insights while enabling visualization
  4. Feature Importance Analysis

    • Applied permutation importance to identify cluster-defining features
    • Used silhouette score reduction as the importance metric
    • Visualized with error bars to show consistency of importance scores
    • Sampling techniques applied for computational efficiency

Outcome¶

The analysis reveals distinct customer segments with clear behavioral differences, enabling targeted marketing strategies based on:

  • Recency, frequency, and monetary (RFM) profile of each segment
  • Satisfaction levels and complaint patterns
  • Product category diversity and seller engagement patterns

The implementation balances computation speed and insight depth, providing actionable customer segmentation with clear understanding of which features define each segment.

In [21]:
import importlib
import src.classes.cluster_analysis
importlib.reload(src.classes.cluster_analysis)
from src.classes.cluster_analysis import ClusteringAnalysis

# Initialize clustering with the transformer for inverse transform capability
ca = ClusteringAnalysis(
    df=transformed_df, 
    transformer=transformer
)
In [22]:
# Generate elbow plot to determine optimal number of clusters
elbow_plot = ca.plot_elbow(range(2, 11))
elbow_plot.show()
Explained variance ratio of PCA components: [4.46357323e-01 2.84277328e-01 1.40936669e-01 9.65714903e-02
 1.89670526e-02 1.15127137e-02 1.16503765e-03 2.12385035e-04
 0.00000000e+00 0.00000000e+00]
Computing clusters:   0%|                                                                                                                                                                           | 0/9 [00:00<?, ?it/s]
Computing clusters:  11%|██████████████████                                                                                                                                                 | 1/9 [00:05<00:41,  5.21s/it]
Computing clusters:  22%|████████████████████████████████████▏                                                                                                                              | 2/9 [00:09<00:34,  4.92s/it]
Computing clusters:  33%|██████████████████████████████████████████████████████▎                                                                                                            | 3/9 [00:14<00:28,  4.81s/it]
Computing clusters:  44%|████████████████████████████████████████████████████████████████████████▍                                                                                          | 4/9 [00:19<00:23,  4.69s/it]
Computing clusters:  56%|██████████████████████████████████████████████████████████████████████████████████████████▌                                                                        | 5/9 [00:23<00:18,  4.59s/it]
Computing clusters:  67%|████████████████████████████████████████████████████████████████████████████████████████████████████████████▋                                                      | 6/9 [00:27<00:13,  4.48s/it]
Computing clusters:  78%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▊                                    | 7/9 [00:31<00:08,  4.36s/it]
Computing clusters:  89%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▉                  | 8/9 [00:35<00:04,  4.26s/it]
Computing clusters: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 9/9 [00:40<00:00,  4.20s/it]
Computing clusters: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 9/9 [00:40<00:00,  4.45s/it]

In [23]:
# Visualize PCA explained variance to select optimal components
fig_pca_variance = ca.plot_pca_explained_variance(max_components=15)
fig_pca_variance.show()
In [24]:
# Create PCA biplot to visualize feature loadings on principal components
fig_biplot = ca.plot_pca_biplot(n_features=5)
fig_biplot.show()
In [25]:
# Visualize feature importance in principal components
fig_pca_importance = ca.plot_pca_feature_importance(n_components=4)
fig_pca_importance.show()
In [26]:
# Run clustering with a different number of clusters (e.g., 5)
ca.fit_kmeans(n_clusters=4)

# Create and show silhouette visualization to evaluate cluster quality
silhouette_fig = ca.plot_silhouette_plotly()
silhouette_fig.show()
Sampling 20,000 records from 96478 for faster silhouette calculation...
In [27]:
# Create and display circle-based intercluster visualization
intercluster_fig = ca.plot_intercluster_distance_circles()
intercluster_fig.show()

Understanding the Clustering Approach¶

The clustering implementation uses a sophisticated dual-model approach:

  1. PCA-based Clustering Model:

    • Used for the elbow method and visualizations
    • Applied to dimensionally-reduced data (PCA components)
    • Advantages: Better visualization, handles correlated features
  2. Original Feature Space Clustering Model:

    • Used for feature importance analysis
    • Applied directly to transformed features (not PCA)
    • Advantages: Direct interpretability of feature effects

This dual approach explains why:

  • The elbow plot may suggest different optimal clusters than silhouette analysis
  • Feature importance relates to the original features, not PCA components
  • We can both visualize clusters effectively and understand feature contributions## Understanding the Clustering Approach

The clustering implementation uses a sophisticated dual-model approach:

  1. PCA-based Clustering Model:

    • Used for the elbow method and visualizations
    • Applied to dimensionally-reduced data (PCA components)
    • Advantages: Better visualization, handles correlated features
  2. Original Feature Space Clustering Model:

    • Used for feature importance analysis
    • Applied directly to transformed features (not PCA)
    • Advantages: Direct interpretability of feature effects

This dual approach explains why:

  • The elbow plot may suggest different optimal clusters than silhouette analysis
  • Feature importance relates to the original features, not PCA components
  • We can both visualize clusters effectively and understand feature contributions
In [28]:
# Define your custom segment configuration - adjust thresholds based on business requirements
custom_segment_config = {
    # Recency thresholds (in days)
    'recency': {
        'active': 45,      # 0-45 days: "Active" (adjusted from default 30)
        'recent': 120,     # 46-120 days: "Recent" (adjusted from default 90)
        # > 120 days: "Inactive"
    },
    
    # Frequency thresholds (number of orders)
    'frequency': {
        'frequent': 2.5,   # > 2.5 orders: "Frequent" (adjusted from default 3)
        'returning': 1.2,  # > 1.2 orders: "Returning" (adjusted from default 1.5)
        # <= 1.2 orders: "One-time"
    },
    
    # Monetary thresholds
    'monetary': {
        'high_value': 150,  # > $150: "High-value" (fixed value instead of default 'mean')
        # <= $150: "Standard-value" 
    },
    
    # Satisfaction thresholds (review scores)
    'satisfaction': {
        'very_satisfied': 4.7,  # >= 4.7: "Very Satisfied" (adjusted from default 4.5)
        'satisfied': 4.2,       # >= 4.2: "Satisfied" (adjusted from default 4.0)
        'neutral': 3.5,         # >= 3.5: "Neutral" (adjusted from default 3.0)
        # < 3.5: "Unsatisfied"
    }
}


from src.scripts.cluster_dashboard import create_sales_dashboard
import numpy as np

# Store the original dataframe in the transformer for reference
transformer.original_df = hybrid_df

# Get the cluster labels from the clustering model
labels = ca.kmeans_results['labels']
print(f"Found {len(labels)} labels across {len(np.unique(labels))} clusters")

# Create the dashboard with custom segment configuration
dashboard = create_sales_dashboard(labels, transformed_df, transformer, custom_segment_config)

# Display the components
for name, fig in dashboard.items():
    fig.show()
Found 96478 labels across 4 clusters

5. Evaluate Segment Stability¶

  • Objective: Assess the stability of customer segments over time.
  • Steps:
    1. Validate feature consistency across time periods.
    2. Compute stability scores for each segment.
    3. Visualize stability trends.
  • Outcome:
    • Ensures segments remain meaningful and actionable over time.

6. Maintenance Recommendations¶